Li et al. BMC Proceedings 201 1, 5(Suppl 9):S46 
http://www.biomedcentral.eom/1753-6561/5/S9/S46 



Proceedings 



PROCEEDINGS Open Access 



Large-scale risk prediction applied to Genetic 
Analysis Workshop 17 mini-exome sequence data 

Gengxin Li 1 , John Ferguson 1 , Wei Zheng 2 , Joon Sang Lee 1 , Xianghua Zhang 1,3 , Lun Li 1,4 , Jia Kang 1 , Xiting Yan 2 , 
Hongyu Zhao 1 * 

From Genetic Analysis Workshop 17 
Boston, MA, USA. 13-16 October 2010 



Abstract 

We consider the application of Efron's empirical Bayes classification method to risk prediction in a genome-wide 
association study using the Genetic Analysis Workshop 17 (GAW17) data. A major advantage of using this method 
is that the effect size distribution for the set of possible features is empirically estimated and that all subsequent 
parameter estimation and risk prediction is guided by this distribution. Here, we generalize Efron's method to allow 
for some of the peculiarities of the GAW17 data. In particular, we introduce two ways to extend Efron's model: a 
weighted empirical Bayes model and a joint covariance model that allows the model to properly incorporate the 
annotation information of single-nucleotide polymorphisms (SNPs). In the course of our analysis, we examine 
several aspects of the possible simulation model, including the identity of the most important genes, the differing 
effects of synonymous and nonsynonymous SNPs, and the relative roles of covariates and genes in conferring 
disease risk. Finally, we compare the three methods to each other and to other classifiers (random forest and 
neural network). 



Background 

The development of disease-risk prediction models 
based on genome-wide association data is a great chal- 
lenge to statisticians. A major contributing factor to this 
difficulty is that the observed effects of the most signifi- 
cant features in any particular model are likely to be 
overestimates of their true effects [1]. Because of the 
complexities of a Bayesian analysis with hundreds of 
thousands of features, most of the shrinkage techniques 
that have been proposed to deal with this problem have 
a frequentist flavor, such as the LASSO (least absolute 
shrinkage and selection operator) and ridge regression 
[2]. Although these procedures tend to be computation- 
ally convenient, the resulting shrinkage could be consid- 
ered ad hoc compared with an empirical Bayes 
alternative [3], because for the empirical Bayes alterna- 
tive model shrinkage is guided directly by both the 
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proportion of associated variants and the effect sizes for 
this subset of associated variants. 

Genetic Analysis Workshop 17 (GAW17) provided a 
large-scale mini-exome sequence data set with a high 
proportion of rare variants. In this data set the number of 
genes far exceeds the number of samples, and, as a result, 
finding a good risk prediction model is a difficult chal- 
lenge. Here, we demonstrate the use of an empirical 
Bayes algorithm, originally proposed by Efron [4] in a 
microarray case-control context, that is particular suita- 
ble to this large-scale data setup. This algorithm is a 
modified version of linear discriminant analysis (LDA) in 
which certain parameters, which represent standardized 
differences in the mean expression for case and control 
subjects, are shrunk before they are substituted into the 
LDA rule. In addition to describing some of the subtleties 
that need to be considered when applying Efron's method 
to the GAW17 data (or other genome-wide association 
data), we develop two extensions that allow us to incor- 
porate single-nucleotide polymorphism (SNP) annotation 
information into the prediction rule: the weighted 
empirical Bayes (WEB) model and the joint covariance 



© 201 1 Li et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons 
Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Li et al. BMC Proceedings 201 1, 5(Suppl 9):S46 
http://www.biomedcentral.eom/1753-6561/5/S9/S46 



Page 2 of 9 



(JC) model To show the competitive performance of our 
proposed methods, we compare them with other classi- 
fiers: the random forest and the neural network. 

Methods 

Choice of gene score 

A gene score is a composite value calculated by combining 
all SNP information within the same gene. Several advan- 
tages are gained by applying Efron's empirical Bayes 
method to such gene scores instead of to individual SNPs. 
First, by pooling SNPs together in the correct way, we can 
potentially enrich the signal-to-noise ratio of the data. Sec- 
ond, the dimensionality of the feature space is greatly 
reduced (from 24,487 SNPs to 3,205 gene scores). Finally, 
even though LDA as a technique does not require the fea- 
ture variables to be normal, it is actually an optimal proce- 
dure if they are. Although the number of rare alleles for a 
particular SNP cannot be considered a normal variable, 
applying this assumption to the score for genes that have 
many SNPs may be more reasonable. 

Let Xij denote the Madsen-Browning gene score [5] 
that summarizes SNPs in gene i for individual This 
gene score is calculated as: 



^ [pia-p,)] 



1/2' 



(1) 



where G /; is the number of rare variants for individual 
j at SNP /, K is the number of SNPs within gene /, and 
p l is the empirical minor allele frequency (MAF) at 
SNP /. In practice, the Madsen-Browning method, 
which up-weights SNPs with a lower MAF when calcu- 
lating gene scores, gives more coherent results on the 
GAW17 data, and whole gene scores are calculated 
based on this pooling method. 

Method 1: empirical Bayes method 

We assume that there are n x case subjects and n 2 con- 
trol subjects, where n is the total number of individuals; 
that is, n = n x + n 2 . Suppose that there is no correlation 
between different gene scores; then the LDA rule is to 
classify an individual having N gene scores (X lf X N ) 
as belonging to the disease or case group if: 



(2) 



i<N 



where 



\V2 



n x + n 2 



(a*u -Hi) hi* 



(3) 



and 



(4) 



Here ft itl is the mean score for the ith. gene in the case 
group, {t i>2 is the mean score for the ith gene in the con- 
trol group, and o t is the common standard deviation of 
the interindividual gene score values for gene i in either 
the case or control group. To apply such a method to 
real data, all the parameters in Eq. (2) must be esti- 
mated. If Gi is known, then the Z test statistic: 



Z i =c 0 Xi ' 1 *'' 2 ~N(g„l), 



(5) 



where 



TliTln 



\V2 



(6) 



has expectation S t and is approximately normally dis- 
tributed. A naive application of LDA would assign an 
individual to the disease group if: 



^Z f Wi >0, 



(7) 



i<N 

where 



7 X i -[(Z u +Z,. 1 )/2] 

Wi - . 



(8) 



In practice, one would want to consider only genes 
with the largest Z statistics in application of Eq. (2), 
effectively restricting the range of the sum to the subset 
of the most associated genes. Unfortunately, a large 
selection bias is associated with using the Z statistics 
directly for this subset of genes, because they are most 
likely large overestimates of the true values of d t . How- 
ever, if we can assume that Z t is normally distributed 
with variance 1 (which is true asymptotically no matter 
what the distribution of the original JQ), we can apply 
the empirical Bayes approach to obtain a Bayes estimate 
of Si that will effectively shrink Z t toward zero using an 
empirically estimated prior distribution. These Bayes 
estimates of Si can then be substituted for Z t in Eq. (7) 
to produce a better prediction rule, which assigns an 
individual to the disease group if: 



(9) 



ieS 
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where S is the subset of genes showing the largest 
marginal association with the disease. 

Model 2: weighted empirical Bayes model 

We expected that nonsynonymous SNPs are more likely 
to be directly involved in disease pathogenesis than 
synonymous SNPs. In this section, we propose a method 
to incorporate this annotation information into the 
empirical Bayes model. By fixing gene i, we separately 
consider two gene scores calculated by restricting the 
set of SNPs to contain only synonymous or only nonsy- 
nonymous SNPs. We denote these gene scores as xf 
and X- , respectively. The relative importance of the 
nonsynonymous SNPs compared to the synonymous 
SNPs in gene i can be measured as: 



log(p-,)-log(p il5 ) 



(10) 



where p t \ n and p t \ s are ^-values associated with the /th 
gene score from the nonsynonymous SNPs and the 
synonymous SNPs, respectively. These ^-values were 
calculated by fitting a logistic regression model in which 
the disease trait is regressed on either the synonymous 
or nonsynonymous gene and the Smoke covariate. A 
larger w t implies that the nonsynonymous SNPs from 
the /th gene have a relatively strong association with the 
disease trait compared with the synonymous SNPs. 
Throughout this section, the superscripts n and s refer 
to nonsynonymous and synonymous, respectively. The 
other notation is consistent with that introduced in the 
Model 1 subsection. 

By combining the gene weight with the gene scores 
from both nonsynonymous SNPs and synonymous 
SNPs, we create a new gene score (weighted score): 



X* = w { X n { + (l-u/ i )X J 5 for i = l,...,N. 



(11) 



In this setting, the LDA rule is to classify an individual 
with new measurements (X*,...,X* N ) as belonging to 
the disease group if: 



(12) 



i<N 



where 5- and W* are defined similarly as in the 
Model 1 subsection. 

As before, S* is estimated by shrinking Z* using the 
empirical Bayes method developed by Efron [4]. The 
test statistic: 



i,2 



N(<5*,1) 



(13) 



still follows a normal distribution with expectation §* 
and variance 1; then the application of LDA would 
assign an individual in the same way. 

Model 3: joint covariance model 

The strong linkage disequilibrium (LD) between nonsy- 
nonymous SNPs and synonymous SNPs for any particu- 
lar gene may induce nonsynonymous SNPs and 
synonymous SNPs to be highly correlated. This correla- 
tion may greatly affect the eventual predicting result. 
Building a bivariate model to incorporate nonsynon- 
ymous and synonymous SNP information simulta- 
neously will properly overcome this difficulty. More 
realistically, we can assume that: 



/ Xf X/ v 



■ N 



r n s \ T 



(14) 



where P t is the correlation matrix for ^X",X- j. Now 
define: 



w ; =( w/\w/ ) 

and 



(15) 



(16) 



After some algebra, it follows that the optimal LDA 
rule is to assign an individual to the disease group if: 



w/pr 1 ^ > 0 



(17) 



i<N 



Here we consider {<5f}. and {^/}. to be 
different populations of parameters, and, as a result, the 
associated empirically estimated prior distributions 
should be different. This motivates shrinking the nonsy- 
nonymous and synonymous Z values separately and 
then applying the resulting Bayes estimates into Eq. 
(17). If there is evidence in the data that the nonsynon- 
ymous SNPs are more powerful in distinguishing 
between disease and nondisease, then the synonymous 
SNPs will be shrunk more. This implicitly gives the non- 
synonymous gene scores higher weight in the prediction 
rule. 

Other issues: multiple replicates, treatment of covariates, 
and cross-validation and selection 

One issue that the models need to take into account is 
multiple replicates. The GAW17 data are generated 
from a simulation model that assigns deleterious 
effects to some coding variants within a subset of 
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genes in specific pathways from the 1000 Genomes 
Project [6]. A unique feature of the GAW17 data is 
that a large proportion of rare variants are reliably 
observed in most of the 200 replicates of the data set. 
Thus for any particular gene /, we can define Z statis- 
tics for R replicates {Z a , Z iR }, each of which has an 
N(S it 1) distribution. One can then use: 



Z, 



1 S 



(18) 



r<R 



as a better estimate of S t , However, Z { no longer has 
variance 1. A naive analysis would propose: 



1 
R 



Var(Z I ): 

However, one would expect that: 
p=Cor(Z is ,Z it )>0 



(19) 



(20) 



because there should be a tendency for the sets of 
individuals having the disease phenotype for any two 
different replicates to have significant overlap. Under 
the assumption that: 



p = Cor (Z is ,Z it )\/s,t 



(21) 



for the 5th replicate and tth replicate for the ith 
gene, 



Var(z,.): 



1 + (R-1)P 
R 



(22) 



We can then standardize Z ; appropriately as: 

Z, 



Zi = 



l + (R-l)p 



R 



1/2 • 



(23) 



Note that 



Z, ~ N 



1 + (R-1)P 
R 



V/2 



,1 



(24) 



Because the new variables Z* have variance 1, Efron's 
shrinkage algorithm can be applied directly to 
I zl,...,Z* 32 05 \ • Note that these shrunken Z values are 
the Bayes estir 



Bayes estimates of: 

i S i\ Z i)= r 7 \ 

1 r i + (j? - i)p 

R 



1/2 ' 



(25) 



where we define: 



1 + (R-1)P 
R 



1/2 



(26) 



The values: 



8 i =E[8' i \z i ) 



(27) 



are then substituted into Eq. (9). To estimate p, we 
assume the relationship: 



-M^Z,,-^) 2 =E(s\) = \- Pl (28) 

/ r<R 



which is true under the assumption Cor(Z /5 , Z it ) - p 
for any i, 5, and t and yields the estimate: 



p = \- 



1 

3205 



si 



(29) 



i<3205 



The second issue in the models is the treatment of 
covariates. The covariates available in the GAW17 data 
(i.e., age, sex, and smoking status) have a dominant role 
in conferring disease risk, and it does not make sense to 
shrink these variables. When we allow covariates into 
our prediction rule, the prediction formula becomes: 



w 



age + Z sex Wsex + Z smoke W S moke > 0. 



The last issue we want to mention is cross-validation 
and selection of the best subset of genes. Cross-validation 
is necessary to select the number of genes involved in any 
of the prediction rules to avoid the bias of prediction 
error. Cross-validation is implemented by using 50 repli- 
cates of the GAW17 data as training data, 50 replicates as 
test data, and the other 100 replicates as validation data. 
The Z scores and associated Bayes estimates are calculated 
on the training data. The error is evaluated on the test 
data using the prediction rule for each possible number of 
genes until we have clearly found the prediction rule with 
the minimum cross-validation error. The best prediction 
rule is finally applied to the validation data to find an 
unbiased estimate of the cross-validation error. The opti- 
mal number of genes to use in the prediction rule is calcu- 
lated based on the prediction accuracy on the test data set. 
It should be noted that for the cross-validation we use a 
rule of the form: 



£$,-Wj>log(0.7/0.3)c 0 



(31) 



i<N 
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to account for the imbalance between case and control 
samples in the actual GAW17 data. 

Other classifiers 

To evaluate the competitive performance of our pro- 
posed methods, we also fitted a random forest classi- 
fier [7] and a neural network classifier to the GAW17 
data. The random forest classifier is known to perform 
remarkably well on a large variety of risk prediction 
problems (see [8]) and has been extensively used in 
genomic applications. The comparable performance to 
other classification methods, such as diagonal linear 
discriminant analysis (DLDA), K nearest neighbor 
(KNN) analysis, and support vector machines (SVM), 
has been demonstrated in a microarray study [9], and 
the successful application to a large data set has been 
demonstrated in a genome-wide association study [10]. 
The technique works by fitting a large number of clas- 
sification or regression trees to bootstrapped versions 



of the original data set and then averaging over all 
these trees to form the prediction rule. The neural net- 
work classifier is another efficient learning method and 
has been widely used in many fields, especially risk 
prediction [8]. 

Results 

Table 1 displays the 10 most important variables that 
were found using the empirical Bayes (EB), weighted 
empirical Bayes (WEB), and joint covariance (JC) 
methods. It is clear that the environmental variables 
Age and Smoke have extremely strong signals and 
dominate the resultant models whenever they are 
included. In addition, the gene FLT1 expresses a 
strong association with the disease trait and is found 
in the gene list for these three methods. We also 
detected another gene, C10ORF107, that is near to the 
true causal gene SIRT1. If we extend the gene list to 
the 30 most highly associated genes, PIK3C2B, another 



Table 1 Prediction rule of three proposed methods 



Feature 


Empirical Bayes method 


Weighted empirical Bayes method 




Joint covariance model 




Genes 


#SNP 


MAF 


Genes #Syn SNP 


#Non SNP 


MAF 


Genes 


#Syn SNP 


#Non SNP 


MAF 


1 


Age 






Age 






Age 








2 


Smoke 






Smoke 






Smoke 








3 


ATP11A 


1 


0.29 


SUSD2 13 


23 


<0.01 


ATP11A 


1 




0.29 










2 


4 


0.01-0.05 


















1 


2 


>0.05 










4 


FLT1 


25 


<0.01 


FLT1 8 


17 


<0.01 


BUD13 


1 




0.11 






7 


0.01-0.05 


5 


2 


0.01-0.05 














3 


>0.05 


2 


1 


>0.05 










5 


SUSD2 


36 
6 




ATP11A 1 




0.29 


C10ORFW7 


1 




0.13 


6 


BUD13 


3 
1 


0.11 


RIPK3 4 


13 


<0.01 


RIPK3 


4 


13 


<0.01 










1 


1 


0.01-0.05 




1 


1 


0.01-0.05 










1 


1 


>0.05 




1 


1 


>0.05 


7 


RIPK3 


17 


<0.01 


BUD13 1 




0.11 


SUSD2 


13 


23 


<0.01 






2 


0.01-0.05 










2 


4 


0.01-0.05 






2 


>0.05 










1 


2 


>0.05 


8 


C10ORF107 


1 


0.13 


ADAMTS4 10 


23 


<0.01 


FLT1 


8 


17 


<0.01 










2 


2 


0.01-0.05 




5 


2 


0.01-0.05 










1 


2 


>0.05 




2 


1 


>0.05 


9 


ADAMTS4 


33 


<0.01 


WNT16 8 


7 


< 0.01 


GPR158 




1 


0.1 






4 


0.01-0.05 


1 


2 


0.01-0.05 














3 


>0.05 




2 


>0.05 










10 


MAP3K12 


14 


<0.01 


GOLGA1 1 




<0.01 


ANAPC5 


14 


12 


<0.01 






3 


0.01-0.05 




1 


0.01-0.05 




1 




0.01-0.05 












1 


>0.05 








>0.05 



Top 10 important features from the model incorporating genes and environmental variables for the three proposed methods. #SNP, number of SNPs within a 
specific gene; #Syn SNP, number of synonymous SNPs; #Non SNP, number of nonsynonymous SNPs. MAF shows three intervals of minor allele frequency: 
MAF < 0.01, 0.01 < MAF < 0.05, and MAF > 0.05. The boldfaced genes and environmental variables are real causal features that are selected across the three 
proposed models. 
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true causal gene, is involved in the prediction rule. 
Under the simulation design for the GAW17 data set, 
if a large proportion of rare variants are involved in 
this data set, then we need to record the number of 
SNPs and the minor allele frequency (MAF) interval 
of SNPs within these highly significant genes (see 
Table 1). It is obvious that the MAF of most SNPs 
within these selected genes is less than 0.01. Both the 
WEB and JC methods incorporate SNP annotation 
information in the models; the number of SNPs is 
further divided into two groups: the number of synon- 
ymous SNPs and the number of nonsynonymous 
SNPs. When compared with the synonymous SNPs, 
the important genes in Table 1 have a larger propor- 
tion of nonsynonymous rare variants in the WEB and 
JC models. 

The feature selection procedure of the EB method is 
also compared with the random forest (RF) method and 
logistic regression (LR). The comparison results are 



summarized in Table 2. According to the RF classifier, 
10 features with the largest sum importance score are 
selected from separate RF classifiers on each of the 100 
replicates. Under LR, 10 features with the smallest p- 
values are chosen from the 100 replicates. In brief, six 
features in the RF method and 10 features in LR are 
consistent with features in the EB model, and the con- 
cordance rate in feature selection is quite high between 
our proposed methods and other classifiers. 

The comparison results of misclassification error for 
our proposed methods are displayed in Table 3. The 
first row in Table 3 gives the average misclassification 
error obtained from the model derived on the training 
and test data to predict the phenotype values of the 100 
validation replicates (see the earlier discussion of cross- 
validation). Note that this error may depend on which 
100 replicates are chosen. To explore this issue, we ran- 
domly split the 200 replicates into training, test, and 
validation sets five times. This enabled us to compute a 



Table 2 Comparison of the prediction rule between the empirical Bayes and other classifiers 



Feature 


Empirical Bayes method 


Random forest classifier 


Logistic regression 




Genes 


#SNP 


MAF 


Genes 


#SNP 


MAF 


Genes 


#SNP 


MAF 


1 


Age 






Age 






Age 






2 


Smoke 






Smoke 






Smoke 






3 


ATP11A 


1 


0.29 


FLT1 


25 


<0.01 


SUSD2 


36 


<0.01 












7 


0.01-0.05 




6 


0.01-0.05 












3 


>0.05 




3 


>0.05 


4 


FLT1 


25 


<0.01 


SUSD2 


36 


<0.01 


ATP 11 A 


1 


0.29 






7 


0.01-0.05 




6 


0.01-0.05 












3 


>0.05 




3 


>0.05 








5 


SUSD2 


36 




SHD 


10 


< 0.01 


BUD13 


1 


0.11 






6 






1 


0.01-0.05 












3 






2 


>0.05 








6 


BUD13 


1 


0.11 


RIPK3 


17 


<0.01 


RIPK3 


17 


<0.01 












2 


0.01-0.05 




2 


0.01-0.05 












2 


>0.05 




2 


>0.05 


7 


RIPK3 


17 


<0.01 


ADAMTS4 


23 


<0.01 


FLT1 


25 


<0.01 






2 


0.01-0.05 




4 


0.01-0.05 




7 


0.01-0.05 






2 


>0.05 




3 


>0.05 




3 


>0.05 


8 


C10ORF107 


1 


0.13 


CECR1 


8 


<0.01 


MAP3K12 


14 


<0.01 














0.01-0.05 




3 


0.01-0.05 












4 


>0.05 






>0.05 


9 


ADAMTS4 


33 


<0.01 


GOLGA1 


1 


<0.01 


ADAMTS4 


33 


<0.01 






4 


0.01-0.05 




1 


0.01-0.05 




4 


0.01-0.05 






3 


>0.05 




1 


>0.05 




3 


>0.05 


10 


MAP3K12 


14 


<0.01 


C14orf108 


16 


<0.01 


C10ORF107 


1 


0.13 






3 


0.01-0.05 




1 


0.01-0.05 


















2 


>0.05 









The top 10 important features from the model incorporating genes and environmental variables between our proposed method (empirical Bayes) and other 
classifiers (random forest and logistic regression). #SNP, number of SNPs within a specific gene. MAF shows three intervals of minor allele frequency: MAF < 0.01, 
0.01 < MAF < 0.05, and MAF > 0.05. The boldfaced genes are real causal features that are selected simultaneously from the three models; for example, FLT1 is 
observed using the three classifiers. 
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Table 3 Cross-validation error and AUC value for the three methods 



Item 


Model 


Statistic 


Empirical Bayes method 


Weighted empirical Bayes method 


Joint covariance model 


Cross-validation error 


Gene + environment 


Mean 


0.26 


0.24 


0.24 






SE 


0.0020 


0.001 1 


0.0012 


AUC value 


Gene + environment 


Mean 


0.76 


0.80 


0.78 






SE 


0.0102 


0.0015 


0.0148 


AUC value 


Gene 


Mean 


0.60 


0.64 


0.62 






SE 


0.0191 


0.0183 


0.0191 



AUC is the area under the ROC curve when minimizing the cross-validation error. SE, standard error of the cross-validation error and the AUC value. 



Table 4 Comparison of AUC value for the empirical Bayes and other classifiers 



Item 


Model 


Statistics 


Empirical Bayes model 


Random forest classifier 


Neural network 1 Neural network 2 


AUC value 


Gene + environment 


Mean 


0.76 


0.67 


0.68 0.70 






SE 


0.0102 







AUC value indicates the area under the ROC curve when minimizing the cross-validation error. Neural network 1 used selected features from the logistic 
regression; neural network 2 used selected features from the empirical Bayes method. SE is the standard error of the AUC value. 



standard error of the mean prediction error for the EB, 
WEB, and JC methods (see Table 3). Note that the dif- 
ferences between the means are large relative to the 
standard errors and likely reflect true differences in the 
performance of the three methods. It is clear that the 
WEB method provides the smallest average misclassifi- 
cation error (0.236) followed by the JC method (0.241) 
and the EB method (0.26). 

We also compared the prediction accuracies for our 
proposed methods using the area under curve (AUC) 
value (Table 3). When both genes and environmental 
variables are involved in the prediction model, the WEB 
method gives the highest AUC value (0.80) followed by 
the JC method (0.78) and the EB method (0.76). All 
three methods perform better than other classifiers: RF 
(0.67), neural network 1 (NN1: 0.68), and neural net- 
work 2 (NN2: 0.70) (Table 4). It is easy to see that the 
EB-based neural network classifier (0.70) provides a lar- 
ger AUC value than the LR-based neural network classi- 
fier (0.68). The relevant three receiver operating 
characteristic (ROC) curves corresponding to our pro- 
posed methods are plotted in Figure 1. 

In summary, our proposed methods significantly 
improve the accuracy of the prediction model com- 
pared with other classifiers. Because the environmental 
variables have such a strong influence in the prediction 
model, we also fitted the EB, WEB, and JC models 
using the genetic variables alone in order to determine 
the prediction accuracy achievable through purely 
genetic information (Table 3). In this case, the best 
AUC value is achieved using the WEB method (0.64) 
followed by the JC method (0.62) and the EB method 
(0.60) (Figure 2). 

Of course, in practical applications more than one 
replicate cannot be obtained. This scenario can be 



represented by training and testing the prediction 
model using only one replicate. When one does this, 
the prediction model based on the EB method is still 
quite good. For example, FLT1 is always in the list of 
the 10 most strongly associated features in the EB 
model. If a similar model is fitted using the RF classi- 
fier, no causal genes tend to be found in the top gene 
list (Table 5). In addition, the EB method provides a 




~ i 1 1 1 1 r~ 

0.0 0.2 0.4 0.6 0.8 1.0 

False Positive Rate 



Figure 1 ROC curves for the EB, WEB, and JC methods for the 
prediction model using genes and environmental covariates. 

The black dotted line is the ROC curve generated from gene and 
environmental covariates in the prediction model based on the 
empirical Bayes (EB) method. The blue solid line is the ROC curve 
from the weighted empirical Bayes (WEB) model. The purple dot- 
dashed line is the ROC curve from the joint covariance (JM) model. 

The red dashed line is the diagonal, 
k J 
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False Positive Rate 

Figure 2 ROC curves for the EB, WEB and JC methods for the 
prediction model using genes only. The black dotted line is the 
ROC curve generated from the prediction model using genes only, 
based on the empirical Bayes (EB) method. The blue solid line is the 
ROC curve from the weighted empirical Bayes (WEB) model. The 
purple dot-dashed line is the ROC curve from the joint covariance 
(JC) model. The red dashed line is the diagonal. 



Table 5 Prediction rule for two classifiers based on one 
replicate 



substantively larger AUC value (0.72) than the RF clas- 
sifier (0.66) (Table 6). 

Conclusions 

It is well known that developing a good disease risk pre- 
diction model based on genome-wide association data is 
a difficult task; the number of predictors can be orders 
of magnitude higher than the number of samples that 
are genotyped. This is certainly the case in the GAW17 
mini-exome data set, in which there is information on 
24,487 SNPs for only 697 samples. In this paper, we 
have used the good properties of the empirical Bayes 
prediction model that Efron [4] developed in a large- 
scale microarray context to build a prediction model for 
these data. An interesting feature of the GAW17 data is 
that they provide annotation information for each SNP 
in the form of a synonymous /nonsynonymous indicator. 
Because only nonsynonymous SNPs affect protein func- 
tion, we expect that they, rather than synonymous 
SNPs, are more likely to be directly involved in disease 
pathogenesis. We propose two ways (weighted empirical 
Bayes model and joint covariance model) to properly 
incorporate this annotation information into the predic- 
tion model. The weighted empirical Bayes model pro- 
vides the best performance (relatively small cross- 
validation error and larger AUC value). We also com- 
pare the three EB classifiers with two other popular clas- 
sifiers (random forest and neural network). The EB 



Feature 


Empirical Bayes classifier 


Random forest classifier 


Genes 


#SNP 


MAF 


Genes 


#SNP 


MAF 


1 


Age 






Age 






2 


Smoke 






Smoke 






3 


GOLGA 1 


1 


<0.01 


OR] 16 




<0.01 






1 


0 01-0 05 




3 


001-0 05 






1 


>0.05 




1 


>0.05 


4 


FLT1 


25 


<0.01 


VTI1B 


9 


<0.01 






7 


0 01-0 05 




1 


001-0 05 






3 


>0.05 




1 


>0.05 


5 


NFKRIA 

1 V / / \U l/\ 


6 


<0.01 


LSI— 1 V/ V Ly 1 i\ 


19 


<0.01 








001-0 05 




3 


001-0 05 






2 


>0.05 




4 


>0.05 


6 


DGKZ 


1 7 


<0.01 


C90RF66 

V— yV-Si 1 / \J\J 


4 


<0.01 






4 


001-0 05 




3 


001-0 05 






1 


>0.05 




4 


>0.05 


7 


SMTN 


23 


<0.01 


CECR1 


8 


<0.01 






4 


0.U l-O.Ub 






0.U l-O.Ub 






2 


>0.05 




4 


>0.05 


8 


PAK7 


1 


0.30 


MAP3K12 


14 


<0.01 












3 


0.01-0.05 














>0.05 


9 


AD AMI 5 


22 


<0.01 


SLC20A2 


24 


<0.01 






5 


0.01-0.05 




4 


0.01-0.05 






3 


>0.05 




1 


>0.05 


10 


ADAMTS4 


33 


<0.01 


ALK 


9 


<0.01 






4 


0.01-0.05 




1 


0.01-0.05 






3 


>0.05 




6 


>0.05 



Top 10 important features from the model incorporating genes and 
environmental variables (Age and Smoke) using one replicate between our 
proposed method (empirical Bayes) and the random forest method. #SNP, 
number of SNPs within a specific gene. MAF shows three intervals of minor 
allele frequency: MAF < 0.01, 0.01 < MAF < 0.05, and MAF > 0.05. The 
boldfaced gene FLT1 still can be selected in the empirical Bayes method but 
is not observed using the random forest method. 



Table 6 Cross-validation error and AUC value for the 
empirical Bayes and random forest methods based on 
one replicate 



Item 


Model 


Statistics 


Empirical 
Bayes 
method 


Random 
forest 
method 


Cross-validation 


Gene + 


Mean 


0.26 


0.23 


error 


environment 












SE 


0.009 




AUC value 


Gene + 


Mean 


0.72 


0.66 




environment 












SE 


0.058 





AUC value is the area under the ROC curve when minimizing the cross- 
validation error. SE is the standard error of the cross-validation error and the 
AUC value. 
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classifiers have superior prediction performance in terms 
of AUC value. Based on this analysis, we think that 
Efrons empirical Bayes risk prediction model, extended 
in the manner that we describe here, is a useful and 
powerful tool for disease risk prediction in genome-wide 
association studies. 
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